Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text
نویسندگان
چکیده
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt segment lengthy with no proper punctuation into simple candidate vitally important preprocessing task in many hard-to-solve NLP applications. To this end, we propose solution for segmenting Arabic potentially independent clauses. This consists of: (1) detection model built on top of multilingual BERT-based model, and (2) some generic rules validating the resulting segmentation. Furthermore, optimize strategy applying these using our suggested greedy-like algorithm. We call proposed PDTS (standing Punctuation Detector Text Segmentation). Concerning evaluation, showcase how can be effectively employed as text tokenizer documents (i.e., mimicking transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study human-based judgment) demonstrate is practically effective both performance quality computational cost. In particular, reach average F-Measure score approximately 75%, indicating minimum improvement roughly 13% compared state-of-the-art competitor models).
منابع مشابه
A Recognition-Based Approach to Segmenting Arabic Handwritten Text
Segmenting Arabic handwritings had been one of the subjects of research in the field of Arabic character recognition for more than 25 years. The majority of reported segmentation techniques share a critical shortcoming, which is over-segmentation. The aim of segmentation is to produce the letters (segments) of a handwritten word. When a resulting letter (segment) is made of more than one piece ...
متن کاملSegmenting Arabic Handwritten Documents into Text lines and Words
In this paper, we present a method for segmenting Arabic handwritten documents into text lines and words. Text line segmentation is addressed by a well-known technique, the horizontal projection profile, in which autocorrelation is used to enhance the self similarity of this profile. This technique promotes the estimation of text line spacing. Word extraction is based on an adaptation of a know...
متن کاملHigh capacity steganography tool for Arabic text using 'Kashida'
Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...
متن کاملMultilayer model for Arabic text compression
This article describes a multilayer model-based approach for text compression. It uses linguistic information to develop a multilayer decomposition model of the text in order to achieve better compression. This new approach is illustrated for the case of the Arabic language, where the majority of words are generated according to the Semitic root-and-pattern scheme. Text is split into three ling...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied sciences
سال: 2022
ISSN: ['2076-3417']
DOI: https://doi.org/10.3390/app122010559